Introduction


What is Ethnic Density?

Ethnic density is defined as the composition of each ethnic group residing in a geographical area of a given size (usually a fairly large geographical area, known as Lower Super Output Area (LSOA) which consists of around 1500 residents).

Here’s an image of London showing the ethnic density or ethnic composition of the city:

Ethnic Density of London

Ethnic Density of London


In the image above, the White British ethnic density is very high (indicated by the dominant dark green colour) across the city but especially in the outskirts of central London. There are pockets of high Asian ethnic density (dark blue), in the East and West of London. Non-British White ethnic groups (yellow) and Black ethnic groups (pink) have a reasonably high ethnic density in and around central London.

The ethnic densities vary across London, and different ethnic groups are dominant across different parts of London.


Measuring ethnic density using the ethnic density score

Based on where one lives and their ethnicity, every individual can be assigned with an ethnic density score. This score is simply the own ethnic groups (or own-group) ethnic density in the area they live in.

Calculating ethnic density score:

  A residential area - "Area A" - has a total of 1500 residents.
  
  The ethnic composition in Area A is: 
      500 residents of Indian descent, 
      250 residents of British descent and 
      750 residents of African descent.
      
  The Indian Ethnic Density would be 500 divided by the total number of 
  residents 1500 (0.33). 
  That is, 33% of all individuals in this area are of Indian descent.
  
  The British Ethnic Density would be 250 / 1500 (0.166). 
  
  The African Ethnic Density would be 750 / 1500 (0.50). 
  
  For Someone of Chinese descent who moves in to Area A, their ethnic density 
  score would be 1 divided by 1500 = ~0.00. Which indicates they live in an 
  area of low own-group ethnic density. 
  
  Someone of African descent would have an ethnic density of 50%. The African 
  person is livinig in an area of __high ethnic density__ (because there are
  more of this ethnic group in Area A, relative to any other ethnic groups). 

The ethnic density of a person hence is indicative of the type of area they live in, in terms of their ethnic composition. So whether they live in high ethnic density area, which is where there are more individuals of their own ethnicity or low ethnic density area, which is more individuals of another ethnicity in their residential area.


Why is Ethnic Density important?


Some studies are reporting that, in a multicultural cities, ethnic minorities living in areas where there are higher proportions of ethnic minority ethnicity may be better off (but in some cases worse) in terms of their mental and physical health relative to ethnic minority groups living in areas with larger proportions of the host ethnicity. This, beneficial effect on health by virtue of the ethnic composition in their residential area, is known as the ethnic density effect.

The figure below is an example suggesting the reporting levels of psychotic symptoms on relevant measures decreases among individuals living in areas of higher own-group ethnic density.

The Ethnic Density effect and Levels of Reporting Psychotic symptoms among White British women, Du Preez et al, 2016

The Ethnic Density effect and Levels of Reporting Psychotic symptoms among White British women, Du Preez et al, 2016


Another example (Figure below), of the ethnic density effect in play in a majority of the ethnicities presented below. There seems to be a reverse effect in White British ethnic group.

Differing effects of Ethnic Density in Different Ethnicities, Das Munshi et al, 2012

Differing effects of Ethnic Density in Different Ethnicities, Das Munshi et al, 2012

As can be seen, the ethnic density effect may not manifest consistently among all ethnic groups, but there is evidence of a protective effect against mental health outcomes.


Studies demonstrating the positive Ethnic Density Effect on Mental Health

This ethnic density “effect” was first reported in 1939 a study by Faris and Dunham. Their study based in Chicago showed that White people, living in areas where Black ethnic groups were predominant, had a higher rate of schizophrenia (137.4 cases per 100,000), compared to the Black residents (39.4 cases per 100,000), where the overall area prevalence rate for schizophrenia was 50.4 cases per 100,000. In another study, Halpern and Nazroo used a nationwide community survey in England and Wales to explore the association of ethnic density and reported on levels of psychiatric symptoms. They showed a negative correlation of own-group ethnic density with neurotic symptoms, such as fatigue, sleep, depression and anxiety, (r = -0.087). That is, with a increase in ethnic density, there is a decrease in the levels of neurotic symptoms. Similarly, they found a negative association of ethnic density with psychotic symptoms (r = -0.113).

Studies demonstrating the a complex mechanism of the effect

Varying degrees of the effects of ethnic density (own-group or combined ethnic minority) on physical and mental health demonstrated positive effects of ethnic density on health outcomes but also detrimental effects in some ethnicities and not others. For example, among Black groups this association is largely reversed with increased risk of premature and all-cause mortality among Black groups with increasing Black ethnic density. The mechanism of the ethnic density effect is complex and requires a deeper understanding of ethnic groups and cultures.

Ethnic Density Effect and Suicidality

There is some evidence of the ethnic density effect being protective, for ethnic minority groups in the community, against suicide-related behaviours. In 2012, a review was published summarising the effect of ethnic density on mental health outcomes, which included suicide-related behaviours (2 studies) [Shaw et al, 2012]. Both studies found reduced risk of self-harm behaviour and completed suicide among ethnic minority groups with increasing ethnic density.

In one study, the rates of A&E attendance for self-harm were compared among White, African-Caribbean and Asian groups. They found that, as the ethnic minority densities increased, the self-harm referral rates of ethnic minorities fell relative to White self-harm referral rate with a risk ratio (RR) of 1.24 (95% CI: 0.69 – 2.10) in lower ethnic minority density versus an RR of 0.61 (0.47 – 0.79) in higher ethnic minority density areas.

In the other study, Neeleman used coroner’s records for completed suicide data to determine subjects’ ethnicity background to generate White and non-White ethnic density for each subject. They found that, as ethnic minority density increased, suicide rates were higher among the White ethnic group with an RR of 1.18 (1.02 – 1.37) and lower among ethnic minority groups RR of 0.75 (0.59 – 0.96).


Aim


Individuals diagnosed with certain mental disorders seen in secondary mental health care have a particularly high risk of suicide mortality compared to the general population. Whether the ethnic density effect has any impact on this risk is not clear.

The aim is to determine if there is an association of ethnic density with completed suicide in this secondary mental health care setting. In other words, this project will aim to study whether living in an area of high or low ethnic density (i.e. surrounded more by people of the same ethnicity or not) has any effect on completed suicide, in mental illness.


The Data


The data is derived from a mental health clinical trust in South London and provides mental healthcare for an area with a population of around 1.4 million residents, to individuals, who are referred by GPs, privately referred, A&E and self-referrals, seeking treatment for mild to severe mental health problems.

The trust uses electronic system to record day-to-day patient interactions (medical, demographic, clinical intervention etc) in either structured notes or free-text fields.

In 2008, a research facility was founded which used this a pseudonymised version of this electronic system from the South London trust for research and clinical audit purposes. Currently there are ~270000 records in this research database. For this project, a subset of patients, and related variables, were extracted based on an inclusion criteria (see below) to create the dataset for this project.


The Cohort

The dataset consists of 47851 patients.

The patients in the dataset were included if they met the following inclusion criteria:

  1. They had an active referral (in the form of face to face contact) at any point between the observation window of 1st of January 2008 and 31st of December 2014.

  2. They had a clinical diagnosis of depression, schizophrenia, schizoaffective, bipolar disorder, manic disorder and alcohol abuse. For patients with multiple diagnoses, the date of diagnosis closest to the observation start date was selected.

  3. They had an area-level address (LSOA code) recorded (to merge with census data). For patients with multiple area-level addresses, the closest address to the date of diagnosis was selected.

  4. They had a known ethnicity recorded (each patients’ ethnicity and ethnic composition in their LSOA was used to assign an ethnic density score) .

Each individual in the cohort is diagnosed with one or more of the disorders mentioned in the table below.

Diagnosis N Number of Suicides
Schizophrenia
No 38091 190
Yes 9438 72
Schizoaffective
No 46199 252
Yes 1330 10
Bipolar
No 42197 216
Yes 5332 46
Substance Abuse
No 30030 169
Yes 17499 93
Depressive
No 27023 152
Yes 20506 110
Manic Disorder
No 45487 251
Yes 2042 11

Data notes

This original data is loaded in R and is named ed. What follows is the process on how ed is cleaned. After cleaning ed, the dataset is renamed edclean.

edclean is the dataset to load and use for feature engineering, Exploratory Data Analysis (EDA) and final analysis.


Cleaning the Data


Structure of the ed dataset


On first glance the str(ed) output (results not shown) indicates:

  • There are 47581 observations of 67 variables,
  • lots of NA values,
  • inexplicable column headings,
  • redundant variables,
  • potentially duplicate variables and
  • several continuous and categorical data variables, that may or may not contain the same data or need to be manipulated for more utility,
  • not all the 67 variables are infomative or useful

Missing value map on the ed dataset

Missing value map on the ed dataset

The Figure above displays all the variables in the “ed” dataset by each row. The missing values are color coded (red for values, non-red for missing values).

  • There are columns with no values (columns D and C) and redundant values (“JunkID”).

  • Diagnosis and diagnosis date columns contain more than 50% NA values. The NA values imply patients who are not diagnosed with the particular disorder. Will replace NA with 0.

  • Some variables with variables names not decipherable and need ranaming (e.g. “AL” = OtherBlack_EDPercent). Id mappings available from original Stata database (not shown).

  • The exposure variable “ethnic density scores” will need to generated using the available columns in the dataset.


Details of each variable needs to be cleaned after table() and table(is.na())` funtions and after looking at the missing map

Variable Notes
JunkID can be deleted, redundant
Gender_Cleaned There are 2 “empty” Gender cells. Will replace them with NA
DOB_Cleaned one NA value in row 45501, not sure what to do with it, will leave until further analysis
Marital_Cleaned 3803 blank values, no NA values. 3803 blank values assigned to NA
primary_diagnosis can be deleted, redundant. lots of different diagnosis in unstructured and structured format. and is potentially redundant because there are other variables that are flag variables for the main diagnoses.
ethnicitycleaned is fine, no NA or blank values
imd_score 49 NA values, not sure what to do here. will leave it in until further anlysis.
diagnosis_date to Bipolar_Diag_Date these variables are dates of the main diagnoses and binary flag variables for diagnoses, NA in these variables means patient does not have the disorder. So perhaps better to change it to 0 instead.
ons_date_of_death 5310 deaths in the cohort
Suicide 263 suicides in the cohort
ICD10_UnderlyingCause redundant variable, delete
LSOAClosestToDiagnosis LSOA area level address code
LSOA11 redundant, delete
lsoa01 redundant, delete
LSOA_NAME name of boroughs
C and D are empty delete
All_Usual_Residents to AP variable names are not informative and need to be renamed. These variables are ethnic density (number and percentage of people of different ethnic groups in the corresponding LSOA code) will be needed to generate the main exposure variable “ethnicdensityscore”

Below is code to clean data based on table and map above


Cleaning the Gender variable

table(ed$Gender_Cleaned) 
table(is.na(ed$Gender_Cleaned)) # no NA values, 2 blank ("") values. 

#       Female   Male 
#     2  23106  24473 

#recoding blank gender values as NA
ed$Gender_Cleaned[ed$Gender_Cleaned == ""] <- NA
table(ed$Gender_Cleaned) 
table(is.na(ed$Gender_Cleaned))

#identify where gender NA rows are
which(is.na(ed$Gender_Cleaned)) # row 682 and 30844
samples_to_remove <- which(is.na(ed$Gender_Cleaned)) # row 682 and 30844

#removing samples from original data "ed"
# overwriting ed with new ed.
ed <- ed[-samples_to_remove, ]

table(is.na(ed$Gender_Cleaned))
table(ed$Gender_Cleaned)
#       Female   Male 
#        23106  24473

Cleaning the Marital Status variable

table(ed$Marital_Cleaned) # 3803 blank ("") values.
table(is.na(ed$Marital_Cleaned)) # no NA values

# replacing blank values in Marital_Cleaned variable with "Unknown"
ed$Marital_Cleaned[ed$Marital_Cleaned == ""] <- "Unknown"

table(ed$Marital_Cleaned) # New category name is "Unknown" = 3803

Cleaning the Diagnoses columns

# replacing NAs with 0 in the diagnosis columns
ed$Schizophrenia_Diag[is.na(ed$Schizophrenia_Diag)] <- 0
ed$SchizoAffective_Diag[is.na(ed$SchizoAffective_Diag)] <- 0
ed$Bipolar_Diag[is.na(ed$Bipolar_Diag)] <- 0
ed$Depressive_Diag[is.na(ed$Depressive_Diag)] <- 0
ed$Manic_Diag[is.na(ed$Manic_Diag)] <- 0
ed$SubAbuse_Diag[is.na(ed$SubAbuse_Diag)] <- 0

Renaming and deleting redundant variables

#renaming columns 

ed <- ed %>% rename(TotalResidentsInLSOA = All_Usual_Residents,
                       WhiteBrit_EDPercent  = G, 
                       WhiteIrish_EDPercent = White_Irish_Percentage, 
                       OtherWhite_EDPercent = White_Other_White_GypsyIrishTrav, 
                       WhiteBlackCarib_EDPercent = P, 
                       WhiteBlackAfri_EDPercent = R, 
                       WhiteAsian_EDPercent = T, 
                       OtherMixed_EDPercent = V, 
                       BritIndian_EDPercent = Asian_Asian_British_Indian_Perce, 
                       BritPakistani_EDPercent = Asian_Asian_British_Pakistani_Pe, 
                       BritBangladeshi_EDPercent = AB, 
                       BritChinese_EDPercent = Asian_Asian_British_Chinese_Perc,
                       OtherAsian_EDPercent = Asian_Asian_British_OtherAsian_P, 
                       African_EDPercent = AH, 
                       Caribbean_EDPercent =  AJ, 
                       OtherBlack_EDPercent = AL, 
                       WhiteBrit_Residents = White_English_Welsh_Scottish_Nor, 
                       TotalIrish_Residents = White_Irish_Count, 
                       OtherWhite_Residents = White_Gypsy_Irish_Traveller_Coun, 
                       MixedCaribbean_Residents = Mixed_Multiple_Ethnic_Groups_Whi, 
                       MixedAsian_Residents = S, 
                       OtherMixed_Residents = Mixed_Multiple_Ethnic_Groups_Oth, 
                       BritIndian_Residents = Asian_Asian_British_Indian_Count, 
                       BritPakistani_Residents = Asian_Asian_British_Pakistani_Co, 
                       BritBangladeshi_Residents = Asian_Asian_British_Bangladeshi_, 
                       BritChinese_Residents = Asian_Asian_British_Chinese_Coun, 
                       OtherAsian_Residents = Asian_Asian_British_OtherAsian_C, 
                       African_Residents = Black_African_Caribbean_BlackBr, 
                       Caribbean_Residents = Black_African_Caribbean_Black_Br, 
                       OtherBlack_Residents = AK, 
                       OtherEthnicity_Residents = Other_Ethnic_Group_AnyOtherEthni) 
#***************************************************************************
# Columns to delete

# C and D  
# lsoa01 
# ICD10_UnderlyingCause 
# primary_diagnosis
# JunkID    

dim(ed) # 47579 67

ed <- ed %>% select(-C, -D, -lsoa01, -ICD10_UnderlyingCause, -primary_diagnosis, -JunkID)

dim(ed) # 47579 61

Cleaned data saved as “Cleaned_Data_ED.Rdata”. This contains dataframe ed.

# save ed 
save(ed, file = "Cleaned_Data_ED.Rdata", compress = TRUE)

Feature Engineering


Creating a new dataset called: edclean, which will be a copy of the original dataset ed and consist of new features.

List of new features added:

Below are the code used to generate the listed new features.


Code generating the Ethnic Density Score (named “ethnicdensityscore”)

  - The main exposure variable of the dataset is Ethnic Density (ED) score. 
    Ethnic density is defined as the composition of each ethnic group residing 
    in a geographical area of a given size (usually a fairly large geographical 
    area, known as Lower Super Output Area (LSOA) which consists of around 
    1500 residents).
  
  - Since, in the original dataset `ed`, each patient was already assigned 
    to ethnic density  (for every ethnic group) based on their LSOA code, to
    assign OWN ethnicity ethnic density score to each individual, the relevant
    ethnic density score based on patient ethnicity was selected and assigned
    to each patient.
    
# Code generating an ethnic density score for each patient

edclean <- ed %>% 
  mutate(ethnicdensityscore = 
          ifelse(ethnicitycleaned == "British (A)", WhiteBrit_EDPercent, 
          ifelse(ethnicitycleaned == "African (N)", African_EDPercent, 
          ifelse(ethnicitycleaned == "Irish (B)", WhiteIrish_EDPercent, 
          ifelse(ethnicitycleaned == "Any other Asian background (L)", OtherAsian_EDPercent, 
          ifelse(ethnicitycleaned == "Any other black background (P)", OtherBlack_EDPercent,
          ifelse(ethnicitycleaned == "Any other mixed background (G)", OtherMixed_EDPercent,
          ifelse(ethnicitycleaned == "Any other white background (C)", OtherWhite_EDPercent,
          ifelse(ethnicitycleaned == "Bangladeshi (K)", BritBangladeshi_EDPercent,
          ifelse(ethnicitycleaned == "Caribbean (M)", Caribbean_EDPercent,
          ifelse(ethnicitycleaned == "Chinese (R)", BritChinese_EDPercent,
          ifelse(ethnicitycleaned == "Indian (H)", BritIndian_EDPercent,
          ifelse(ethnicitycleaned == "Pakistani (J)", BritPakistani_EDPercent,
          ifelse(ethnicitycleaned == "White and Asian (F)", WhiteAsian_EDPercent,
          ifelse(ethnicitycleaned == "White and Black African (E)", WhiteBlackAfri_EDPercent,
          ifelse(ethnicitycleaned == "White and black Caribbean (D)", WhiteBlackCarib_EDPercent,  
          " " )))))))))))))))) %>% 
  mutate(ethnicdensityscore = as.numeric(ethnicdensityscore)) 

# save ed with new feature into new dataset "edclean"
save(edclean, file = "Data_ED_new_features.Rdata", compress = TRUE)

Code generating the Ethnicity variable (named “ethnicity”)

  Some of the ethnic groups are too small in numbers, these groups 
  will be aggregated while still maintaining their own-ethnic group ethnic 
  density scores and their respective suicide numbers also 
  decrease (data not shown), which make it difficult to analyse and could 
  potentially bias estimates towards ethnic groups that have larger sizes. 
  
  Grouping ethnic groups into larger ethnic categories (White, Other White,
  Irish, Black, Other Black, Caribbean, Asian and Mixed Race).
  
  See code feature_engineering_ethnicity in the Capstone_Project_Draft.Rmd 
  for code.
  

Generating Age variables (named “ageatdiagnosis”, “ageatdeath” and “agegroups”)

  The dataset only has pseudonymised date of birth variables
  and age can be generated from them using date of diagnosis.
  
  See Chunk "feature_engineering_create_age_variables" in 
  Capstone_Project_Draft.Rmd for code.
  
  Summaries and counts provided below. 
# summary(edclean$ageatdiagnosis)

#  Age at Diagnosis Summary
#  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.00   30.00   40.00   42.57   52.00  104.00 

# table(edclean$agegroups)

# Patient counts by age groups
#  < 25  26-40  41-60 61-100 
#  7726  16075  16532   7196 

# summary(edclean$ageatdeath)

# Age at Death Summary
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#  10.00   52.00   69.00   66.49   82.00  107.00   42222

Code for generating Borough variable (named “LSOA_4boroughs”)

# tabulating patient LSOA code
# these need to grouped into the main boroughs for this particular Trust
# table(edclean$LSOA_NAME)


# generating borough variable ("LSOA_4boroughs")
edclean <- edclean %>% 
  mutate (LSOA_4boroughs = 
            ifelse(grepl("^Southw", edclean$LSOA_NAME) %in% TRUE, "SOUTHWARK", 
            ifelse(grepl("^Croy", edclean$LSOA_NAME) %in% TRUE, "CROYDON",
            ifelse(grepl("^Lambe", edclean$LSOA_NAME) %in% TRUE, "LAMBETH",
            ifelse(grepl("^Lewish", edclean$LSOA_NAME) %in% TRUE, "LEWSIHAM", 
                   "OTHER")))))

x <- as.data.frame(table(edclean$LSOA_4boroughs))
kable(x)
# |Var1      |  Freq|
# |:---------|-----:|
# |CROYDON   | 10419|
# |LAMBETH   |  9979|
# |LEWSIHAM  |  8135|
# |OTHER     |  9941|
# |SOUTHWARK |  9104|

table(is.na(edclean$LSOA_4boroughs)) # no NA values

# save new feature into new dataset "edclean"
save(edclean, file = "Data_ED_new_features.Rdata", compress = TRUE)

Generating Cause of Death Variable (named “DeathBy”)

  - Create flag variables for individuals:
    i) who died by suicide
    ii) who died of other causes
    iii) who are not dead
  
  - See code feature_engineering_cause_of_death in 
    Capstone_Project_Draft.Rmd for code.
    
  - Count summary provided below. 
  
table(edclean$DeathBy)

#   NotDied    OtherCause    Suicide 
#     42222          5045        262

Final Dataset


This dataset edclean contains the patient demographics, ethnic density score (main exposure variable), Suicide variable (main outcome variable), confounding variable and other variables.

The edclean dataset is the starting point for the rest of the analyses. The Data Dictionary for it dataset is provided below (See _Data Dictionary__).

# generating the final dataset
edclean <- edclean %>% 
  select(Gender_Cleaned, DOB_Cleaned, Marital_Cleaned, 
  diagnosisdate, ageatdiagnosis, Schizophrenia_Diag, 
  SchizoAffective_Diag,Depressive_Diag,SubAbuse_Diag, 
  Manic_Diag,Bipolar_Diag,ethnicitycleaned,ethnicity,
  ethnicdensityscore,imd_score,dateofdeath,Suicide,
  ageatdeath,agegroups,LSOA_4boroughs,LSOA11,DeathBy, 
  TotalResidentsInLSOA, WhiteBrit_EDPercent, WhiteIrish_EDPercent,
  OtherWhite_EDPercent, WhiteBlackCarib_EDPercent,
  WhiteBlackAfri_EDPercent, WhiteAsian_EDPercent,
  OtherMixed_EDPercent, BritIndian_EDPercent,
  BritPakistani_EDPercent, BritBangladeshi_EDPercent,
  BritChinese_EDPercent, OtherAsian_EDPercent,
  African_EDPercent, Caribbean_EDPercent,
  OtherBlack_EDPercent)

# this is important before doing regression analysis (as is converting all the 
# other categorical variables into factors).
edclean$Suicide <- as.factor(edclean$Suicide)

# save new feature into new dataset "edclean"
save(edclean, file = "Data_ED_new_features.Rdata", compress = TRUE)

Exploring the data

This section describes basic exploratory data analysis with the outcome, Suicide, and the exposure variable (ethnic density score) and any interactions and associations with other available variables of interest (age, gender, marital status, area-level deprivation and borough.)

Suicide and its unadjusted association with the variables of interest and the main exposure are first analysed. This is followed by exploring ethnic density score and its association with relevant variables.


Exploring Death by Suicide

The following tables shows the distribution of deaths by Suicide (0=No, 1=Yes) and the demographic variables. A chi-square test is also conducted to test for association between the factors and the outcome of Suicide.


Suicide and Gender

# descriptive_table_suicide_gender
x <- as.data.frame(table(edclean$Gender_Cleaned,edclean$Suicide)) %>% spread(Var2,Freq)
names(x) <- c("Gender","0","1")
xstat <- chisq.test(edclean$Suicide, edclean$Gender_Cleaned)
x$Chi <- ""
x$P <- ""
x[2,4] <- round(xstat$statistic, 3)
x[2,5] <- signif(xstat$p.value, 2)
kable(x)
Gender 0 1 Chi P
Female 23003 80
Male 24264 182 33.57 6.9e-09

Suicide and Age Groups

a <- as.data.frame(table(edclean$agegroups, edclean$Suicide)) %>% spread(Var2, Freq)
names(a) <- c("Age Groups", "0", "1")
a$Chi <- ""
a$P <- ""
astat <- chisq.test(edclean$agegroups, edclean$Suicide)
a[4,4] <- round(astat$statistic, 3)
a[4,5] <- signif(astat$p.value, 2)
kable(a)
Age Groups 0 1 Chi P
< 25 7705 21
26-40 15977 98
41-60 16422 110
61-100 7163 33 17.06 0.00069

Suicide and Marital Status

b <- as.data.frame(table(edclean$Marital_Cleaned, edclean$Suicide)) %>% spread(Var2, Freq)
names(b) <- c("Marital Status", "0", "1")
b$Chi <- ""
b$P <- ""
bstat <- chisq.test(edclean$Marital_Cleaned, edclean$Suicide)
b[4,4] <- round(bstat$statistic, 3)
b[4,5] <- signif(bstat$p.value, 2)
kable(b)
Marital Status 0 1 Chi P
Divorced / Separated / Widowed 7366 34
Married / Cohabiting 8345 46
Single 27790 156
Unknown 3766 26 2.413 0.49

Suicide and Ethnicity

c <- as.data.frame(table(edclean$ethnicity, edclean$Suicide)) %>% spread(Var2, Freq)
names(c) <- c("Ethnicity", "0", "1")
c$Chi <- ""
c$P <- ""
cstat <- chisq.test(edclean$ethnicity, edclean$Suicide)
c[8,4] <- round(cstat$statistic, 3)
c[8,5] <- signif(cstat$p.value, 2)
kable(c)
Ethnicity 0 1 Chi P
Asian 2536 11
Black 3277 16
Caribbean 2742 13
Irish 1558 9
Mixed Race 1293 8
Other Black 3739 9
Other White 4365 20
White 27757 176 11.855 0.11

Suicide and Borough

d <- as.data.frame(table(edclean$LSOA_4boroughs, edclean$Suicide)) %>% spread(Var2, Freq)
names(d) <- c("Borough", "0", "1")
d$Chi <- ""
d$P <- ""
dstat <- chisq.test(edclean$LSOA_4boroughs, edclean$Suicide)
d[5,4] <- round(dstat$statistic, 3)
d[5,5] <- signif(dstat$p.value, 2)
kable(d)
Borough 0 1 Chi P
CROYDON 10370 49
LAMBETH 9934 45
LEWSIHAM 8090 45
OTHER 9810 82
SOUTHWARK 9063 41 18.684 0.00091

Suicide and Deprivation

qplot(Suicide, imd_score, data = edclean, main = "Area-level Deprivation by Suicide") +
  geom_boxplot() +
  xlab("Suicide (1 = Yes, 0 = No)") +
  ylab("Area-level Deprivation Score")

# Comparing mean deprivation score by Suicides vs Non-Suicides
edclean %>% select(imd_score, Suicide) %>% group_by(Suicide) %>% summarise(Mean = mean(imd_score), S.D = sd(imd_score), N = n()) %>% kable()
Suicide Mean S.D N
0 29.42796 10.87067 47267
1 28.54211 11.34683 262
# t-test to check for significant differences in mean deprivation score
with(edclean, t.test(imd_score[Suicide == 0], imd_score[Suicide == 1], conf.level = 0.95, paired = FALSE))
## 
##  Welch Two Sample t-test
## 
## data:  imd_score[Suicide == 0] and imd_score[Suicide == 1]
## t = 1.2605, df = 263.66, p-value = 0.2086
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4979416  2.2696497
## sample estimates:
## mean of x mean of y 
##  29.42796  28.54211

Suicide and Ethnic Density Score

qplot(Suicide, ethnicdensityscore, data = edclean, main = "Ethnic Density Scores by Suicide") +
  geom_boxplot() +
  xlab("Suicide (1 = Yes, 0 = No)") +
  ylab("Ethnic Density Score")

# Comparing mean ethnic density score by Suicides vs Non-Suicides
edclean %>% select(ethnicdensityscore, Suicide) %>% 
  group_by(Suicide) %>% 
  summarise(Mean = mean(ethnicdensityscore), S.D = sd(ethnicdensityscore), N = n()) %>% 
  kable()
Suicide Mean S.D N
0 32.92411 25.90010 47267
1 38.20856 27.02586 262

T-test to check for significant differences in mean deprivation score

# t-test to check for significant differences in mean deprivation score
with(edclean, t.test(ethnicdensityscore[Suicide == 0], ethnicdensityscore[Suicide == 1], conf.level = 0.95, paired = FALSE))
## 
##  Welch Two Sample t-test
## 
## data:  ethnicdensityscore[Suicide == 0] and ethnicdensityscore[Suicide == 1]
## t = -3.157, df = 263.66, p-value = 0.001779
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.580377 -1.988521
## sample estimates:
## mean of x mean of y 
##  32.92411  38.20856

The difference of mean ethnic density score by Suicide is significantly different. Mean ethnic density score among those who died by suicide is significantly higher compared to those who did not die by suicide.


Suicide and Ethnic Density Score by Ethnicity

From the Suicide and Ethnicity table, due to the large proportion of White ethnic group and the fact that ethnic density score are generated using ethnicity it is worth exploring ethnic density distribution by suicide within each ethnic group.

Below are boxplots of the ethnic density distributions by Suicide within each ethnic group.

ethnicity Suicide Mean S.D N
Asian 0 5.061238 5.7740780 2536
Asian 1 9.663636 13.8316501 11
Black 0 15.210070 8.9196143 3277
Black 1 16.650000 10.0434390 16
Caribbean 0 10.780671 4.9713853 2742
Caribbean 1 9.715385 5.2925807 13
Irish 0 2.079076 0.8917008 1558
Irish 1 1.711111 0.4456581 9
Mixed Race 0 2.185460 1.1687715 1293
Mixed Race 1 1.937500 1.2861210 8
Other Black 0 4.814228 2.2278974 3739
Other Black 1 6.366667 1.7979155 9
Other White 0 12.389223 5.2981810 4365
Other White 1 12.112080 7.5515498 20
White 0 49.927589 20.1629816 27757
White 1 52.165909 21.2804693 176

There seems to be some differences in ethnic density means by suicide in each ethnic group. The ethnic density distribution by ethnic group differs in terms in of ranges (but this is explore further below when exploring ethnic density score distributions).


Below is a table Comparing Mean Ethnic Density by Suicie within each Ethnic Group using t-test within each ethnicity group.

ethnicity Suicide Mean S.D N T-test
Asian 0 5.061238 5.7740780 2536 t = -1.1032, df = 10.015, p-value = 0.2958
Asian 1 9.663636 13.8316501 11
Black 0 15.210070 8.9196143 3277 t = -0.57238, df = 15.116, p-value = 0.5755
Black 1 16.650000 10.0434390 16
Caribbean 0 10.780671 4.9713853 2742 t = 0.72421, df = 12.101, p-value = 0.4827
Caribbean 1 9.715385 5.2925807 13
Irish 0 2.079076 0.8917008 1558 t = 2.4488, df = 8.3743, p-value = 0.03872
Irish 1 1.711111 0.4456581 9
Mixed Race 0 2.185460 1.1687715 1293 t = 0.54392, df = 7.0717, p-value = 0.6032
Mixed Race 1 1.937500 1.2861210 8
Other Black 0 4.814228 2.2278974 3739 t = -2.5856, df = 8.0592, p-value = 0.03214
Other Black 1 6.366667 1.7979155 9
Other White 0 12.389223 5.2981810 4365 t = 0.16394, df = 19.086, p-value = 0.8715
Other White 1 12.112080 7.5515498 20
White 0 49.927589 20.1629816 27757 t = -1.3914, df = 177, p-value = 0.1658
White 1 52.165909 21.2804693 176

Comparing the ethnic density score means by Suicide in different ethnic groups, produces different results than when comparing means in the entire cohort (t = -3.157, p-value = 0.001779). The table above, shows the mean ethnic density scores are significantly different in the Irish and Other Black ethnic groups. But there is no difference in the other ethnic groups.


Exploring ethnic density distribution (main exposure)


Ethnic density distribution by Ethnicity

The plot above show the square root (for less noise) of the ethnic density score distribution by each ethnic group. It is clear that ethnic minority group have relatively smaller ethnic density distribution ranges and there are relatively fewer of their ethnicities known to mental health services compared to the White ethnic group.


Here is a summary table comparing means by ethnic groups.

ethnicity MEAN MEDIAN SD N VARIANCE QT25 QT75
White 49.942 46.9 20.171 27933 406.869 34.700 63.000
Other White 12.388 12.0 5.309 4385 28.185 8.693 15.466
Other Black 4.818 4.9 2.228 3748 4.964 3.200 6.200
Black 15.217 13.3 8.924 3293 79.638 8.600 20.100
Caribbean 10.776 10.6 4.972 2755 24.721 7.300 14.100
Asian 5.081 3.3 5.834 2547 34.036 1.600 6.200
Irish 2.077 2.0 0.890 1567 0.792 1.500 2.600
Mixed Race 2.184 2.0 1.169 1301 1.367 1.300 2.900

The plot and tables above indicates that the ethnic density distribution differs by ethnicity.

The non White British ethnic groups have an ethnic density distribution with a much less range than the White British ethnic density distribution. This potentially reflects under-representation of ethnic minority groups in mental health services in South East London.

The Irish, Mixed Race and Other Black races have limited range of ethnic density scores (all below 12%)

There seems to different levels of ethnic density “exposure” (depending on ethnicity). Whether these levels are representative of the ethnic density distributions for South East London cannot be determined.

The table below shows results from the ANOVA test conducted to determine if the difference in means are significant. The results show that there is a difference in ethnic density score means by ethnicity.

Degrees of Freedom F value p-value
ethnicity 7 11372 <0.001
Residuals 47521

Ethnic density distribution by Borough

The White ethnic group is the most represented group across the boroughs as shown below in the barplot.

They also have the largest ethnic density distributions across boroughs and compared to other ethnic groups as shown in the table below.

The mean Ethnic Density Score by Ethnicity and Borough

LSOA_4boroughs Asian Black Caribbean Irish Mixed Race Other Black Other White White N
CROYDON 6.874 11.172 12.757 1.554 2.380 4.937 7.502 49.160 10419
LAMBETH 2.008 14.194 10.887 2.488 2.436 5.455 16.279 39.526 9979
LEWSIHAM 3.606 14.134 12.357 1.910 2.587 4.909 10.577 41.840 8135
OTHER 7.195 10.352 4.296 1.803 1.270 2.155 11.503 67.691 9892
SOUTHWARK 2.367 21.170 8.435 2.266 1.916 5.026 11.751 41.263 9104

In the plot below, across all boroughs the ethnic minority group all have mean scores below 50%, while the White ethnic density scores show a full range of scores (0 - 100%, mean score ranges from ~39% to ~67%).

The anova test show the difference in means by borough and ethnicity are statistically different.

lm_edscore_by_borough_and_ethnicity <- lm(ethnicdensityscore ~ as.factor(LSOA_4boroughs)*as.factor(ethnicity) , data = edclean)
anova(lm_edscore_by_borough_and_ethnicity)
## Analysis of Variance Table
## 
## Response: ethnicdensityscore
##                                                   Df   Sum Sq Mean Sq
## as.factor(LSOA_4boroughs)                          4  5294953 1323738
## as.factor(ethnicity)                               7 16970520 2424360
## as.factor(LSOA_4boroughs):as.factor(ethnicity)    28  1327188   47400
## Residuals                                      47489  8311998     175
##                                                 F value    Pr(>F)    
## as.factor(LSOA_4boroughs)                       7562.92 < 2.2e-16 ***
## as.factor(ethnicity)                           13851.11 < 2.2e-16 ***
## as.factor(LSOA_4boroughs):as.factor(ethnicity)   270.81 < 2.2e-16 ***
## Residuals                                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary: Exploring Ethnic Density

There may be a case of under-representation of non-White British ethnic groups in this cohort. In order, to answer our question “Does ethnic density predict suicide?”, analysing ethnic minority groups may not provide an unbiased answer. The final analysis could be conducted separately by each ethnic group or just within the White ethnic group (see table below for counts by Suicide and Ethnic group).

 |Ethnicity   | Sui:No| Sui:Yes|Chi    |P    |
 |:-----------|-----: |---:    |:------|:----|
 |Asian       |  2536 |  11    |       |     |
 |Black       |  3277 |  16    |       |     |
 |Caribbean   |  2742 |  13    |       |     |
 |Irish       |  1558 |   9    |       |     |
 |Mixed Race  |  1293 |   8    |       |     |
 |Other Black |  3739 |   9    |       |     |
 |Other White |  4365 |  20    |       |     |
 |White       | 27757 | 176    |11.855 |0.11 |

Further Exploration

Before starting analysis, there is another part of the research related to ethnic density introducted here. This part will also undergo the same process of feature engineering and will also be explored and analysed later.

Introduction to Trust Ethnic Density and how does it differ from the original ethnic density score measure

On discussion with my mentor we came up with another potential research question: “How does each patient’s ethnic density in the community (population ethnic density) compare to their ethnic density within the Trust (trust ethnic density)?” In other words can we predict ethnic density within the trust by patients’ population ethnic density.

Defining Trust Ethnic Density

Like the original ethnic density score, which is defined as the composition of each ethnic group residing in a geographical area of a given size, the trust ethnic density is the percentage composition of each ethnic group in a given group of patients residing in the same LSOA code and who have been referred to the trust. Comparing the original ethnic density score (which will be referred to as the population ethnic density score) to the trust ethnic density score, can give an idea of whether being referred to mental health services can be explained in part by one’s population ethnic density.

What follows below is code for additional feature engineering and exploration of trust ethnic density (outcome) in the context of ethnic density (main exposure).


Additional Feature Engineering

Three new variables were generated - “LSOAsize”, “trust.ed” and “ratio”

Description of the new additional variables

Variable
LSOAsize The total number of patients within a given LSOA code
trust.ed The percentage ethnic composition within a given LSOA code
ratio trust.ed divided by populaiton ethnic density score variable (ethnicdensityscore).Whether the ethnic density of a certain individual within the Trust is proportionate to the ethnic density of the individual in the community can be calculated using the ratio of Trust ethnic density to Population Ethnic Density
load("Data_ED_new_features.Rdata")

LSOAethnicdensity <- edclean %>% 
  dplyr:::select(ethnicdensityscore, ethnicity, imd_score,
                 ageatdiagnosis, LSOA11, ethnicitycleaned, 
                 LSOA_4boroughs, Suicide, DeathBy, Gender_Cleaned, 
                 Marital_Cleaned, WhiteBrit_EDPercent, OtherWhite_EDPercent, 
                 African_EDPercent) %>% group_by(LSOA11, ethnicity) %>% 
  mutate(ethcount = length(ethnicity)) %>% 
  group_by(LSOA11) %>% 
  mutate(LSOAsize = n(),
         trust.ed = ((ethcount/LSOAsize)*100),  
         ratio = trust.ed/ethnicdensityscore) %>% 
  ungroup() %>% 
  distinct() %>% 
  mutate(Gender_Cleaned=factor(Gender_Cleaned, levels=c("Male","Female")),
         Marital_Cleaned=factor(Marital_Cleaned,
                                levels=c("Unknown","Single","Married / Cohabiting","Divorced / Separated / Widowed")),
         LSOA_4boroughs=factor(LSOA_4boroughs, 
                               levels=c("OTHER","CROYDON","SOUTHWARK","LEWSIHAM","LAMBETH"))) %>%
         mutate(Suicide = ifelse(Suicide == 0, "No", "Yes"))
LSOAethnicdensity$Suicide <- as.factor(LSOAethnicdensity$Suicide)

save(LSOAethnicdensity, file="LSOAethnicdensity.Rdata")

A brief look at the data to demonstrate how “ratio” links “ethnicdensityscore” and “trust.ed”.

ethnicity LSOAsize ethnicdensityscore trust.ed ratio
White 3 60.400000 66.66667 1.103753
Other White 3 15.230312 33.33333 2.188618
White 3 60.400000 66.66667 1.103753
Black 1 8.900000 100.00000 11.235955
White 2 15.400000 100.00000 6.493506
White 2 15.400000 100.00000 6.493506
White 1 54.300000 100.00000 1.841621
Asian 1 7.600000 100.00000 13.157895
White 1 54.800000 100.00000 1.824817
White 1 44.400000 100.00000 2.252252
White 2 48.100000 50.00000 1.039501
Black 2 16.400000 50.00000 3.048780
Other White 1 7.147297 100.00000 13.991304
White 1 50.200000 100.00000 1.992032
White 1 50.800000 100.00000 1.968504

Correlation of Trust ED and Population ED

The graph below shows a strong positive correlation between overall Trust Ethnic Density (trust.ed) and Population Ethnic Density (ethnicdensityscore). The correlation test follows the plot. The correlation by ethnicity is different as can be seen by the difference in coloured dots (representing different ethnic groups) in the plot below:

cor.test(LSOAethnicdensity$ethnicdensityscore, LSOAethnicdensity$trust.ed)
## 
##  Pearson's product-moment correlation
## 
## data:  LSOAethnicdensity$ethnicdensityscore and LSOAethnicdensity$trust.ed
## t = 338.42, df = 45849, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8424203 0.8476540
## sample estimates:
##       cor 
## 0.8450574

Trust ethnic density versus Population Ethnic Density Facetted by Ethnicity (see below for corresponding correlations)

The plots below show that the correlation between trust ethnic density and population ethnic density differs by ethnic group.

Within the White British ethnic group, there is a clear positive correlation between patient’s trust ethnic density scores and community ethnic density scores.

For the rest of the ethnic groups, there is a less clear correlation, almost no correlation (For the Irish, Mixed Race and Other Black ethnic group this could be because of the restricted range of ethnic density scores).

Table of correlations

Pearson’s correlation value, p-value
Asian 0.35, < 2.2e-16
Black -0.045, 0.009
Caribbean -0.046, 0.01523
Irish 0.01658627, 0.5127
Mixed Race -0.2418812, < 2.2e-16
Other Black -0.1521063, < 2.2e-16
Other White 0.176316, < 2.2e-16
White 0.76, < 2.2e-16

Ratio versus Population Ethnic Density

ratio can be plotted by population ethnic density scores (ethnicdensityscore) to represent how much the trust ethnic density score can vary by the population ED.

Interpreting the Ratio variable in the plot above:

  - The horizontal (red) line of 1 represents an optimal representation of 
    Ethnic Density (ED) in both the Trust and in the Community. That is 
    to say, if a patient had a 50% ethnic density in the community, they 
    also have a 50% or approximately 50% ethnic density within the Trust.
    
  - The closer the RATIO value is to 1 the more equal the ratio of Trust 
    Ethnic Density to Population Ethnic Density is. 
    
  - The yellow trend line, (function: geom_smooth, which uses generalised 
    additive model (gam) with integrated smoothness estimation).
    

From the plot, there seems to be a recurring pattern in all ethnic groups. While most the patients in each ethnic groups have a proportionate representation (i.e. “ratio” is 1 or close to 1), patients (regardless of ethnic group) living in areas of less than 5% ethnic density (i.e. there are fewer than 5% of their own-group ethnicity in their residential area) tend to have a really high trust ED to community ED ratio (i.e. they are the only people from their LSOA code to be represented or known to mental health service).


EDA: Take home points

There are some points to consider before analysis:

  • Ethnic Density differs by Ethnicity

Ethnic density distributions vary by ethnicity, with all the ethnic minority groups living in areas where there are less than 50% of their ethnic group. This could be the true ethnic density range of these ethnic minority groups.

The White ethnic group however is “exposed” to the full range of the ethnic density distribution (i.e. 0% - 100%) and hence I examine completed suicides in this group only from here on.

  • Ethnic density differs by boroughs

The distribution of ethnic density differs by boroughs in each ethnic group.

  • Suicide and Demographics

Suicide is extremely rare and hence the data here are highly unbalanced. Final analysis will have to take this into consideration.

Borough, age groups and gender have some association with death by suicide.

  • Population Ethnic Density seems to predict Trust Ethnic Density

The exploratory analysis revealed a negative correlation of population ethnic density and trust ethnic density. A second piece of analysis can be conducted to investigate the association of trust ethnic density and population ethnic density.

With this in mind, the analysis will be conducted in two parts:

The first analysis will aim to answer Can completed suicides be predicted by the ethnic density scores?

  • Outcome: Death By Suicide - Suicide coded as 0 (Not Died by Suicide) or 1 (Died by Suicide)
  • Exposure: Ethnic Density Scoreethnicdensityscore
  • Other Variables: age, gender, marital status, deprivation score and borough.

The second analysis will answer Can we predict trust/sample ethnic density using population ethnic density scores?

  • Outcome: Trust Ethnic Density - trust.ed (scores)
  • Exposure: Ethnic Density Scoreethnicdensityscore (scores)
  • Other Variables: age, gender, marital status, deprivation score and borough.

For the first analysis, the association of ethnic density and completed suicide will be investigated in the White ethnic group. The White British ethnic groups form a large proportion of the entire cohort and are the most representated across boroughs. Exploring ethnic density and suicide in this group will allow to examine the “full effect” of ethnic density on deaths by suicide in a psychiatric healthcare setting.

For the second analysis, the association of the ethnic density and completed suicide will be investigated in the entire cohort.


Data Analysis: Part 1


Ethnic Density and Suicide: Logistic regression

Since Suicide is a binary outcome of 1 and 0 a logistic regression will be conducted to predict deaths by suicide using patient ethnic density score.

Model 1: Unadjusted analysis - Suicide and Ethnic density score only

## 
## Call:
## glm(formula = Suicide ~ ethnicdensityscore, family = "binomial", 
##     data = dataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.1279  -0.1160  -0.1111  -0.1076   3.2424  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -5.335226   0.205679 -25.940   <2e-16 ***
## ethnicdensityscore  0.005378   0.003668   1.466    0.143    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2134.5  on 27932  degrees of freedom
## Residual deviance: 2132.4  on 27931  degrees of freedom
## AIC: 2136.4
## 
## Number of Fisher Scoring iterations: 8

Here is the exponential function of the estimate and the anova of the model to test for significant differences.

exp(logistic_model_base$coefficients)

#       (Intercept) ethnicdensityscore 
#       0.004818821        1.005392132 

## anova
anova(logistic_model_base, test="Chisq") 

##                    Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                               27932     2134.5         
## ethnicdensityscore  1   2.1205     27931     2132.4   0.1453

Model 2: Suicide and Ethnic density, with the other variables of interest Results from the logistic regresion and corresponding anova test.

## 
## Call:
## glm(formula = Suicide ~ Gender_Cleaned + ageatdiagnosis + Marital_Cleaned + 
##     imd_score + LSOA_4boroughs + ethnicdensityscore, family = "binomial", 
##     data = dataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.2051  -0.1236  -0.1095  -0.0866   3.5356  
## 
## Coefficients:
##                                                Estimate Std. Error z value
## (Intercept)                                   -3.989637   0.585128  -6.818
## Gender_CleanedFemale                          -0.665341   0.165314  -4.025
## ageatdiagnosis                                 0.005972   0.004751   1.257
## Marital_CleanedSingle                         -0.169396   0.256716  -0.660
## Marital_CleanedMarried / Cohabiting           -0.337914   0.306962  -1.101
## Marital_CleanedDivorced / Separated / Widowed -0.522282   0.333981  -1.564
## imd_score                                     -0.009147   0.008106  -1.129
## LSOA_4boroughsCROYDON                         -0.665546   0.236345  -2.816
## LSOA_4boroughsSOUTHWARK                       -0.537767   0.254998  -2.109
## LSOA_4boroughsLEWSIHAM                        -0.459912   0.257400  -1.787
## LSOA_4boroughsLAMBETH                         -0.635262   0.269999  -2.353
## ethnicdensityscore                            -0.003931   0.005165  -0.761
##                                               Pr(>|z|)    
## (Intercept)                                   9.21e-12 ***
## Gender_CleanedFemale                          5.70e-05 ***
## ageatdiagnosis                                 0.20882    
## Marital_CleanedSingle                          0.50935    
## Marital_CleanedMarried / Cohabiting            0.27097    
## Marital_CleanedDivorced / Separated / Widowed  0.11786    
## imd_score                                      0.25910    
## LSOA_4boroughsCROYDON                          0.00486 ** 
## LSOA_4boroughsSOUTHWARK                        0.03495 *  
## LSOA_4boroughsLEWSIHAM                         0.07398 .  
## LSOA_4boroughsLAMBETH                          0.01863 *  
## ethnicdensityscore                             0.44667    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2134.5  on 27932  degrees of freedom
## Residual deviance: 2098.4  on 27921  degrees of freedom
## AIC: 2122.4
## 
## Number of Fisher Scoring iterations: 8

The exponential value of the estimate for ethnic density is (using the exp(logistic_model$coefficients) function) ~1 (0.996).

## anova
anova(logistic_model, test="Chisq") 
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Suicide
## 
## Terms added sequentially (first to last)
## 
## 
##                    Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                               27932     2134.5              
## Gender_Cleaned      1  20.0083     27931     2114.5 7.711e-06 ***
## ageatdiagnosis      1   0.1469     27930     2114.3   0.70153    
## Marital_Cleaned     3   2.9622     27927     2111.4   0.39749    
## imd_score           1   2.0390     27926     2109.3   0.15331    
## LSOA_4boroughs      4  10.3269     27922     2099.0   0.03527 *  
## ethnicdensityscore  1   0.5770     27921     2098.4   0.44749    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Comparison of the base model (Model 1) and full model (Model 2)

# ANOVA
anova(logistic_model_base, logistic_model, test ="Chisq")
## Analysis of Deviance Table
## 
## Model 1: Suicide ~ ethnicdensityscore
## Model 2: Suicide ~ Gender_Cleaned + ageatdiagnosis + Marital_Cleaned + 
##     imd_score + LSOA_4boroughs + ethnicdensityscore
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     27931     2132.4                          
## 2     27921     2098.4 10    33.94 0.0001891 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a significant difference between both models. The full model is possibly the better model as it adjusts or takes into consideration other potential confounding variables.


Selecting the better model

The Pseudo r^2 test is defined below and allows to select models based on “R^2” but for logistic regression analysis.

    "Unlike linear regression with ordinary least squares estimation, there is no 
    R2 statistic which explains the proportion of variance in the dependent variable
    that is explained by the predictors. However, there are a number of pseudo R2 
    metrics that could be of value. Most notable is McFadden’s R2, which is defined as 
    1−[ln(LM)/ln(L0)] where ln(LM) is the log likelihood value for the fitted model 
    and ln(L0) is the log likelihood for the null model with only an intercept as a 
    predictor. The measure ranges from 0 to just under 1, with values closer to zero 
    indicating that the model has no predictive power."

Results from the Pseudo R^2 test

Base_Model_pseudo_R2 Full_Model_pseudo_R2
McFadden 0.0009934 0.016894

The table shows both models having little predictive power as both r^2 are close to 0. The fully adjusted model performs better than the base model (0.01 versus 0.001, respectively)


Conclusion

Given all other predictor variables, ethnic density is not associated with deaths by suicide.




Predictive Modelling


Even though the logistic regression above suggest that ethnic density is a weak predictor of death by suicide in mental health, there is a glaring issue with this analysis above in that it is not taking into account that suicide is a rare event (262 suicides out of ~47K observations). The data is unbalanced. Perhaps ethnic density could predict deaths by suicide better if the data were balanced.

In addition, the EDA strongly suggest that there is no association of ethnic density with suicide however an exercise in predictive modelling and to formally assess the ability of ethnic density score and the other variables’ ability to classify completed suicides, a generalised linear regression method was used to build a classification model using the R package caret. SMOTE is used to balance the data (http://search.r-project.org/library/performanceEstimation/html/smote.html). Model performance was assessed using area under the curve, sensitivity and specificity.

Model Building in Caret

The code below uses functions (trainControl and train) in the caret package to do the following:

Here is the code that defines how to train the model

# small numbers in Suicide == YES class so not splitting into train and test
# resampling approach used instead

# trainControl: set training sampling and tuning parameters
# k-fold cv: 5 fold, repeated 20 times = 100 sample sets
# data not balanced so using SMOTE
control_smote_2class <- trainControl(method = "repeatedcv", 
                     number = 5, 
                     repeats = 20, 
                     sampling = "smote",
                     summaryFunction = twoClassSummary,
                     returnResamp="all",
                     classProbs = TRUE,
                     savePredictions = "all",
                     returnData=TRUE)

Here is the code that builds a predictive model

# builiding the model: glm, binomial, select on best metric using "ROC" curve
mod_fit_smote_suicide <- train(as.factor(Suicide) ~ ethnicdensityscore + Gender_Cleaned + ageatdiagnosis + Marital_Cleaned + imd_score + LSOA_4boroughs,  
                 data = edclean.vars.white, 
                 method = "glm", 
                 family="binomial", 
                 trControl = control_smote_2class, 
                 #tuneLength = 5,
                 metric = "ROC")

Performance of the model in Predicting Suicides

# the predictive summary can be given by printing the below code.
mod_fit_smote_suicide
## Generalized Linear Model 
## 
## 27933 samples
##     6 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 20 times) 
## Summary of sample sizes: 22347, 22345, 22347, 22347, 22346, 22346, ... 
## Addtional sampling using SMOTE
## 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.5837908  0.7661935  0.3377857
## 
## 
# summary of model
# summary(mod_fit_smote_suicide)
# As in the unbalanced analysis, the analysis using the balanced model also tells us that ethnic density scores are not predictive of completed suicides, given the predictors. Gender, age and borough are associated with death by suicide. Compared to males, females are protected against suicide. With every unit increase in age, the risk of dying by suicide increases. Compared to the "OTHER" borough, all other boroughs are at lower risk of deathy by suicide. 

ROC Curve

# roc curve
plot(rocCurve, legacy.axes=TRUE)

## 
## Call:
## roc.default(response = edclean.vars.white$Suicide, predictor = pred[,     "Yes"])
## 
## Data: pred[, "Yes"] in 27757 controls (edclean.vars.white$Suicide No) < 176 cases (edclean.vars.white$Suicide Yes).
## Area under the curve: 0.6212

In terms of how well the model predicts suicide, the AUC value is 0.58. This means the predictive value of model is pretty poor. Looking at the sensitivity and the specificity. The plot above shows who poorly the model predicts completed suicides.

Results from the predict function

#Building a confustion matrix
pred <- predict(mod_fit_smote_suicide)
confusionMatrix(pred, reference=edclean.vars.white$Suicide, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  21192   111
##        Yes  6565    65
##                                         
##                Accuracy : 0.761         
##                  95% CI : (0.756, 0.766)
##     No Information Rate : 0.9937        
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.0069        
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.369318      
##             Specificity : 0.763483      
##          Pos Pred Value : 0.009804      
##          Neg Pred Value : 0.994789      
##              Prevalence : 0.006301      
##          Detection Rate : 0.002327      
##    Detection Prevalence : 0.237354      
##       Balanced Accuracy : 0.566401      
##                                         
##        'Positive' Class : Yes           
## 

The confusion matrix shows us how times the model has correctly predicted suicide and provides us with other performance metrics.

Summary


Conclusion: Part 1 of Data Analysis

Results, suggest little clinical utility of the model (and ethnic density) for suicide prediction.


Data Analysis: Part 2


Population Ethnic Density and Trust Ethnic Density: Linear regression

This second part will answer Can we predict trust/sample ethnic density and ratio by population ethnic density scores?

Note: this analysis will be conducted among the White ethnic group and where the “LSOAsize” (for definition see table “Description of the new additional variables”) is above 19. I had conducted the analysis on the entire cohort initially but this introduces patterns in the residual plots that disappear when LSOAsize are above 10 and when analysing by ethnic groups.


Plots of age, deprivation, trust ethnic density and population ethnic density score

The plot above show correlation between “trust.ed”, “ethnicdensityscore”, “imd_score” and “ageatdiagnosis”. There is no correlation of age with other variables. There is positive correlation between “trust.ed” and “ethnicdensityscore”. There is a negative correlation between deprivation scores (“imd_score”) and “trust.ed” and “ethnicdensityscore”.


Plotting Interaction Trees

par(mfrow = c(1,1))
library(tree)
model <- tree(trust.ed ~ ethnicdensityscore + ageatdiagnosis + Gender_Cleaned + LSOA_4boroughs + Marital_Cleaned + imd_score, data = subset(LSOAethnicdensity, LSOAsize > 19 & ethnicity == "White"))
plot(model)
text(model)

The interaction tree shows that borough could potentially be interacting with ethnic density score.


Model 1: Trust ethnic density (outcome)

From the EDA, the tree plot and pairs plot above the following model was built, which included interaction between population ethnic density score, borough and deprivation score.

“Linear regression: trust.ed ~ ethnicdensityscoreLSOA_4boroughsimd_score + ageatdiagnosis + Gender_Cleaned + Marital_Cleaned”

# Model 1
# Full model: Trust Ethnic Density and all predictors
# Linear regression: trust.ed ~ ethnicdensityscore*LSOA_4boroughs*imd_score + ageatdiagnosis + Gender_Cleaned + Marital_Cleaned
# The tree model suggests interactions between ethnic density score and borough. 
# The literature suggests a negative correlation of ethnic density and deprivation in the White or host ethnic group. 

linear.model.age.gender.demog <- lm(trust.ed ~ ethnicdensityscore*LSOA_4boroughs*imd_score + ageatdiagnosis + Gender_Cleaned + Marital_Cleaned, data = subset(LSOAethnicdensity, LSOAsize > 19 & ethnicity == "White"))

Summaring results from Model 1

#Summarising full model
summary(linear.model.age.gender.demog) 
## 
## Call:
## lm(formula = trust.ed ~ ethnicdensityscore * LSOA_4boroughs * 
##     imd_score + ageatdiagnosis + Gender_Cleaned + Marital_Cleaned, 
##     data = subset(LSOAethnicdensity, LSOAsize > 19 & ethnicity == 
##         "White"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.634  -4.918   0.072   4.906  22.952 
## 
## Coefficients:
##                                                        Estimate Std. Error
## (Intercept)                                           63.701313   3.052941
## ethnicdensityscore                                     0.142422   0.048258
## LSOA_4boroughsCROYDON                                -29.557386   3.154573
## LSOA_4boroughsSOUTHWARK                              -30.950198   3.515002
## LSOA_4boroughsLEWSIHAM                               -26.030816   3.595451
## LSOA_4boroughsLAMBETH                                -23.909759   3.496290
## imd_score                                             -0.479064   0.075525
## ageatdiagnosis                                         0.003108   0.003357
## Gender_CleanedFemale                                   0.059806   0.112299
## Marital_CleanedSingle                                 -0.185420   0.210040
## Marital_CleanedMarried / Cohabiting                    0.233335   0.239439
## Marital_CleanedDivorced / Separated / Widowed          0.063023   0.246566
## ethnicdensityscore:LSOA_4boroughsCROYDON               0.471593   0.049984
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK             0.240070   0.058340
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM              0.283859   0.063514
## ethnicdensityscore:LSOA_4boroughsLAMBETH               0.103247   0.060114
## ethnicdensityscore:imd_score                           0.011365   0.001280
## LSOA_4boroughsCROYDON:imd_score                        0.422401   0.080457
## LSOA_4boroughsSOUTHWARK:imd_score                      0.356883   0.089949
## LSOA_4boroughsLEWSIHAM:imd_score                       0.233026   0.094302
## LSOA_4boroughsLAMBETH:imd_score                        0.146249   0.088120
## ethnicdensityscore:LSOA_4boroughsCROYDON:imd_score    -0.007982   0.001366
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK:imd_score  -0.001371   0.001629
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM:imd_score   -0.003737   0.001823
## ethnicdensityscore:LSOA_4boroughsLAMBETH:imd_score    -0.004761   0.001667
##                                                      t value Pr(>|t|)    
## (Intercept)                                           20.866  < 2e-16 ***
## ethnicdensityscore                                     2.951  0.00317 ** 
## LSOA_4boroughsCROYDON                                 -9.370  < 2e-16 ***
## LSOA_4boroughsSOUTHWARK                               -8.805  < 2e-16 ***
## LSOA_4boroughsLEWSIHAM                                -7.240 4.65e-13 ***
## LSOA_4boroughsLAMBETH                                 -6.839 8.22e-12 ***
## imd_score                                             -6.343 2.30e-10 ***
## ageatdiagnosis                                         0.926  0.35444    
## Gender_CleanedFemale                                   0.533  0.59434    
## Marital_CleanedSingle                                 -0.883  0.37736    
## Marital_CleanedMarried / Cohabiting                    0.975  0.32982    
## Marital_CleanedDivorced / Separated / Widowed          0.256  0.79826    
## ethnicdensityscore:LSOA_4boroughsCROYDON               9.435  < 2e-16 ***
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK             4.115 3.89e-05 ***
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM              4.469 7.89e-06 ***
## ethnicdensityscore:LSOA_4boroughsLAMBETH               1.718  0.08590 .  
## ethnicdensityscore:imd_score                           8.876  < 2e-16 ***
## LSOA_4boroughsCROYDON:imd_score                        5.250 1.54e-07 ***
## LSOA_4boroughsSOUTHWARK:imd_score                      3.968 7.28e-05 ***
## LSOA_4boroughsLEWSIHAM:imd_score                       2.471  0.01348 *  
## LSOA_4boroughsLAMBETH:imd_score                        1.660  0.09700 .  
## ethnicdensityscore:LSOA_4boroughsCROYDON:imd_score    -5.842 5.23e-09 ***
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK:imd_score  -0.842  0.40003    
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM:imd_score   -2.050  0.04038 *  
## ethnicdensityscore:LSOA_4boroughsLAMBETH:imd_score    -2.857  0.00429 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.924 on 20256 degrees of freedom
## Multiple R-squared:  0.7104, Adjusted R-squared:   0.71 
## F-statistic:  2070 on 24 and 20256 DF,  p-value: < 2.2e-16
# Multiple R-squared:  0.7104,  Adjusted R-squared:   0.71
# non significant terms are ageatdiagnosis, gender and marital status. 
anova(linear.model.age.gender.demog)
## Analysis of Variance Table
## 
## Response: trust.ed
##                                                Df  Sum Sq Mean Sq
## ethnicdensityscore                              1 2453957 2453957
## LSOA_4boroughs                                  4  586077  146519
## imd_score                                       1   26360   26360
## ageatdiagnosis                                  1     470     470
## Gender_Cleaned                                  1     107     107
## Marital_Cleaned                                 3     674     225
## ethnicdensityscore:LSOA_4boroughs               4   18011    4503
## ethnicdensityscore:imd_score                    1   16569   16569
## LSOA_4boroughs:imd_score                        4   13068    3267
## ethnicdensityscore:LSOA_4boroughs:imd_score     4    4034    1009
## Residuals                                   20256 1271824      63
##                                                F value    Pr(>F)    
## ethnicdensityscore                          39083.5322 < 2.2e-16 ***
## LSOA_4boroughs                               2333.5716 < 2.2e-16 ***
## imd_score                                     419.8263 < 2.2e-16 ***
## ageatdiagnosis                                  7.4909  0.006206 ** 
## Gender_Cleaned                                  1.7045  0.191712    
## Marital_Cleaned                                 3.5762  0.013303 *  
## ethnicdensityscore:LSOA_4boroughs              71.7147 < 2.2e-16 ***
## ethnicdensityscore:imd_score                  263.8906 < 2.2e-16 ***
## LSOA_4boroughs:imd_score                       52.0321 < 2.2e-16 ***
## ethnicdensityscore:LSOA_4boroughs:imd_score    16.0624 3.884e-13 ***
## Residuals                                                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The non-significant values are removed from the model and assessed using R^2.

# Model 2
linear.model.age.gender.demog.rm.age <- update(linear.model.age.gender.demog, ~. -ageatdiagnosis)
summary(linear.model.age.gender.demog.rm.age) 
#Multiple R-squared: 0.7104,    Adjusted R-squared:   0.71 


# Model 3
linear.model.age.gender.demog.rm.age.gender <- update(linear.model.age.gender.demog.rm.age, ~. -Gender_Cleaned)
summary(linear.model.age.gender.demog.rm.age.gender) 
#Multiple R-squared:  0.7104,   Adjusted R-squared:   0.71 


# Model 4
linear.model.age.gender.demog.rm.age.gender.marital <- update(linear.model.age.gender.demog.rm.age.gender, ~. -Marital_Cleaned)
summary(linear.model.age.gender.demog.rm.age.gender.marital) 
#Multiple R-squared:  0.7102,   Adjusted R-squared:  0.7099 

All models perform roughly the same. Model 1 and Model 4 will be selected to check diagnostics.

Diagnostic plots

Model 1

plot(linear.model.age.gender.demog, which = c(1,2))

Model 4

plot(linear.model.age.gender.demog.rm.age.gender.marital, which = c(1,2))

Both models are equally good. Model 4 will be selected as the final model to predict trust ethnic density.


Output from Model 4

## 
## Call:
## lm(formula = trust.ed ~ ethnicdensityscore + LSOA_4boroughs + 
##     imd_score + ethnicdensityscore:LSOA_4boroughs + ethnicdensityscore:imd_score + 
##     LSOA_4boroughs:imd_score + ethnicdensityscore:LSOA_4boroughs:imd_score, 
##     data = subset(LSOAethnicdensity, LSOAsize > 19 & ethnicity == 
##         "White"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.483  -4.879   0.137   4.965  22.714 
## 
## Coefficients:
##                                                        Estimate Std. Error
## (Intercept)                                           63.741631   3.045672
## ethnicdensityscore                                     0.143998   0.048260
## LSOA_4boroughsCROYDON                                -29.462539   3.154551
## LSOA_4boroughsSOUTHWARK                              -30.963113   3.515554
## LSOA_4boroughsLEWSIHAM                               -25.964733   3.595507
## LSOA_4boroughsLAMBETH                                -23.829325   3.496284
## imd_score                                             -0.478910   0.075531
## ethnicdensityscore:LSOA_4boroughsCROYDON               0.471147   0.049988
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK             0.239621   0.058350
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM              0.282985   0.063519
## ethnicdensityscore:LSOA_4boroughsLAMBETH               0.101730   0.060116
## ethnicdensityscore:imd_score                           0.011334   0.001280
## LSOA_4boroughsCROYDON:imd_score                        0.420475   0.080457
## LSOA_4boroughsSOUTHWARK:imd_score                      0.357999   0.089964
## LSOA_4boroughsLEWSIHAM:imd_score                       0.231735   0.094310
## LSOA_4boroughsLAMBETH:imd_score                        0.145644   0.088125
## ethnicdensityscore:LSOA_4boroughsCROYDON:imd_score    -0.007953   0.001366
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK:imd_score  -0.001345   0.001629
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM:imd_score   -0.003688   0.001823
## ethnicdensityscore:LSOA_4boroughsLAMBETH:imd_score    -0.004738   0.001667
##                                                      t value Pr(>|t|)    
## (Intercept)                                           20.929  < 2e-16 ***
## ethnicdensityscore                                     2.984  0.00285 ** 
## LSOA_4boroughsCROYDON                                 -9.340  < 2e-16 ***
## LSOA_4boroughsSOUTHWARK                               -8.807  < 2e-16 ***
## LSOA_4boroughsLEWSIHAM                                -7.221 5.33e-13 ***
## LSOA_4boroughsLAMBETH                                 -6.816 9.65e-12 ***
## imd_score                                             -6.341 2.34e-10 ***
## ethnicdensityscore:LSOA_4boroughsCROYDON               9.425  < 2e-16 ***
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK             4.107 4.03e-05 ***
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM              4.455 8.43e-06 ***
## ethnicdensityscore:LSOA_4boroughsLAMBETH               1.692  0.09062 .  
## ethnicdensityscore:imd_score                           8.851  < 2e-16 ***
## LSOA_4boroughsCROYDON:imd_score                        5.226 1.75e-07 ***
## LSOA_4boroughsSOUTHWARK:imd_score                      3.979 6.93e-05 ***
## LSOA_4boroughsLEWSIHAM:imd_score                       2.457  0.01401 *  
## LSOA_4boroughsLAMBETH:imd_score                        1.653  0.09841 .  
## ethnicdensityscore:LSOA_4boroughsCROYDON:imd_score    -5.820 5.96e-09 ***
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK:imd_score  -0.825  0.40911    
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM:imd_score   -2.023  0.04310 *  
## ethnicdensityscore:LSOA_4boroughsLAMBETH:imd_score    -2.843  0.00448 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.925 on 20261 degrees of freedom
## Multiple R-squared:  0.7102, Adjusted R-squared:  0.7099 
## F-statistic:  2613 on 19 and 20261 DF,  p-value: < 2.2e-16

From the summary, with every unit increase in ethnic density score, there is a 1.15 times increase in trust ethnic density. This reflects the results from the EDA plots in the White ethnic group. The trust ethnic density is strongly positively correlated with population ethnic density, which intuitively makes sense as well.


Ratio and Ethnic Density


Can the relationship seen with ethnic density and ratio be statistically shown?

Model 1

# summary(ModelA) # Multiple R-squared:  0.6626,    Adjusted R-squared:  0.6622 

Removing age as it is not significant (see Model B).

ModelB <- update(ModelA, ~. -ageatdiagnosis)

# summary(ModelB) #Multiple R-squared:  0.6626, Adjusted R-squared:  0.6622

Removing gender as it is not significant (see Model C).

ModelC <- update(ModelB, ~. -Gender_Cleaned)

# summary(ModelC) # Multiple R-squared:  0.6625,    Adjusted R-squared:  0.6622 

Removing marital status as it is not significant (see Model D).

ModelD <- update(ModelC, ~. -Marital_Cleaned)

# summary(ModelD) # Multiple R-squared:  0.6623,    Adjusted R-squared:  0.662 

Assessing by just looking at the R-squared, the performance of these models are similar. All suggesting a negative association of ethnic density score and ratio (as expected from EDA). With every increase in ethnic density score, the ratio decreases by 0.02. The association is very weak (OR 0.988) but it is significant.


Conclusion from Investigating population the association of ratio, trust ethnic density with population ethnic density


Ethnic density is a strong predictor of trust ethnic density. However, it can also predict the ratio comparing trust ethnic density with respective population ethnic density score.


The Data Story

The project initially started with investigation the relationship with Suicide and population ethnic density. To that effect, the association of Suicide with ethnic density was explored in exploratory data analysis and in the final analysis. We concluded that ethnic density is not a predictor of death by suicide in this clinical cohort. This is contradictory to the suggested effect of ethnic density and suicide related behaviour in a community setting, where there are indications of a protective effect. There could be several limitations to our results. We are assuming that each individual is exposed to the ethnic density score at the time of suicide as well.

During exploratory data analysis, the relationship between trust ethnic density and population ethnic density was uncovered. It turned out that an increase in ethnic density in the population did not mean a proportionate reflection in mental health services. In fact, individuals living in areas where there were very of their own ethnic residents, were most likely to be known to mental health services. This increased odds of being known to services as population ethnic density decreases was replicated across all ethnic groups. The results suggests that there could an ethnic density effect and that the lower this effect the higher the chances of experiencing mental health issues. Further work is required to explore this investigation fully.


Data Dictionary

Outcome

Variable Name Definition/Categories
Suicide binary variable for patient who died by suicide; 0 = Not died by suicdie, 1 = died by suicide

Exposure: Ethnic Density Score

Variable Name Definition/Categories
ethnicdensityscore Defined as the percentage composition of each ethnic group residing in a geographical area of a given size.

Demographics

Variable Name Definition/Categories
Gender_Cleaned Female, Male, Unknown
Marital_Cleaned Divorced / Separated / Widowed ; Married / Cohabiting ; Single ; Undisclosed
DOB_Cleaned Patient date of birth (dob)
ethnicitycleaned Patient ethnicity
ethnicity Aggregated ethnic groups: White, Other White, Black, Asian and Mixed
imd_score patient’s area level deprivation score. The higher the score, the more deprived the area
imd_quartiles fill in text here! [ref]
ageatdeath age at death (any cause of death)
ageatdiagnosis age at primary diagnosis
agegroups age groups according to age at diagnosis
LSOA_4boroughs Boroughs; CROYDON; LAMBETH; LEWSIHAM; OTHER; SOUTHWARK

Death related variables

Variable Name Definition/Categories
dateofdeath date of death for patients who died
DeathBy cause of death

Diagnosis related variables

Variable Name Definition/Categories
primary_diagnosis first diagnosis closest to the start of the observation window
diagnosisdate date of primary diagnosis
Schizophrenia_Diag binary variable to indicate if the patient has had a diagnosis of Schizophrenia disorder at some point during the observation window
SchizoAffective_Diag binary variable to indicate if the patient has had a diagnosis of Schizoaffective disorder at some point during the observation window
Depressive_Diag binary variable to indicate if the patient has had a diagnosis of Depressive disorder (mild to severe) at some point during the observation window
SubAbuse_Diag binary variable to indicate if the patient has had a diagnosis of Substance Abuse disorder at some point during the observation window
Manic_Diag binary variable to indicate if the patient has had a diagnosis of Manic disorder at some point during the observation window
Bipolar_Diag binary variable to indicate if the patient has had a diagnosis of Bipolar disorder at some point during the observation window

Overall Ethnic Density in each known LSOA variable

Variable Name Definition/Categories
LSOA11 Each patients’ area-level address code. This geographical code, covers an ares of ~1500 residents
TotalResidentsInLSOA The actual number of residents in the corresponding LSOA code
WhiteBrit_EDPercent The percentage ethnic density, or ethnic composition, of White British ethnic group in the corresponding LSOA code
WhiteIrish_EDPercent The percentage ethnic density, or ethnic composition, of Irish ethnic group in the corresponding LSOA code
OtherWhite_EDPercent The percentage ethnic density, or ethnic composition, of White British ethnic group in the corresponding LSOA code
WhiteBlackCarib_EDPercent The percentage ethnic density, or ethnic composition, of Mixed White and Black Caribbean ethnic group in the corresponding LSOA code
WhiteBlackAfri_EDPercent The percentage ethnic density, or ethnic composition, of Mixed White and Black African ethnic group in the corresponding LSOA code
WhiteAsian_EDPercent The percentage ethnic density, or ethnic composition, of Mixed White and Asian ethnic group in the corresponding LSOA code
OtherMixed_EDPercent The percentage ethnic density, or ethnic composition, of any other Mixed Race ethnic group in the corresponding LSOA code
BritIndian_EDPercent The percentage ethnic density, or ethnic composition, of the Indian ethnic group in the corresponding LSOA code
BritPakistani_EDPercent The percentage ethnic density, or ethnic composition, of the Pakistani ethnic group in the corresponding LSOA code
BritBangladeshi_EDPercent The percentage ethnic density, or ethnic composition, of the Bangladeshi ethnic group in the corresponding LSOA code
BritChinese_EDPercent The percentage ethnic density, or ethnic composition, of the Chinese ethnic group in the corresponding LSOA code
OtherAsian_EDPercent The percentage ethnic density, or ethnic composition, of any other Asian ethnic groups in the corresponding LSOA code
African_EDPercent The percentage ethnic density, or ethnic composition, of Black British or African ethnic group in the corresponding LSOA code
Caribbean_EDPercent The percentage ethnic density, or ethnic composition, of the Caribbean ethnic group in the corresponding LSOA code
OtherBlack_EDPercent The percentage ethnic density, or ethnic composition, of any other Black ethnic group in the corresponding LSOA code

sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.5 (El Capitan)
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] parallel  grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] tree_1.0-37     pROC_1.8        doMC_1.3.4      iterators_1.0.8
##  [5] foreach_1.4.3   DMwR_0.4.1      MASS_7.3-45     forestplot_1.4 
##  [9] magrittr_1.5    caret_6.0-68    lattice_0.20-33 Amelia_1.7.4   
## [13] Rcpp_0.12.5     knitr_1.13      gridExtra_2.2.1 GGally_1.0.1   
## [17] gmodels_2.16.2  ggplot2_2.1.0   dplyr_0.4.3     tidyr_0.4.1    
## [21] lubridate_1.5.6 foreign_0.8-66 
## 
## loaded via a namespace (and not attached):
##  [1] class_7.3-14       zoo_1.7-13         gtools_3.5.0      
##  [4] assertthat_0.1     digest_0.6.9       R6_2.1.2          
##  [7] plyr_1.8.3.9000    MatrixModels_0.4-1 stats4_3.3.0      
## [10] e1071_1.6-7        evaluate_0.9       highr_0.6         
## [13] gplots_3.0.1       lazyeval_0.1.10    minqa_1.2.4       
## [16] gdata_2.17.0       SparseM_1.7        car_2.1-2         
## [19] TTR_0.23-1         nloptr_1.0.4       rpart_4.1-10      
## [22] Matrix_1.2-6       rmarkdown_0.9.6    labeling_0.3      
## [25] splines_3.3.0      lme4_1.1-12        stringr_1.0.0     
## [28] munsell_0.4.3      compiler_3.3.0     mgcv_1.8-12       
## [31] htmltools_0.3.5    nnet_7.3-12        codetools_0.2-14  
## [34] reshape_0.8.5      bitops_1.0-6       nlme_3.1-128      
## [37] gtable_0.2.0       DBI_0.4-1          formatR_1.4       
## [40] scales_0.4.0       KernSmooth_2.23-15 quantmod_0.4-5    
## [43] stringi_1.0-1      ROCR_1.0-7         reshape2_1.4.1    
## [46] xts_0.9-7          tools_3.3.0        abind_1.4-3       
## [49] pbkrtest_0.4-6     yaml_2.1.13        colorspace_1.2-6  
## [52] caTools_1.17.1     quantreg_5.24